Final Project - Motor Vehicle Collisions - Crashes

Addressing SDG 3.6 Death Rate due to Road Traffic Injuries.

Table of content:
  1. Motivation
  2. Baisc stats. Let's understand the dataset better.
  3. Data Analysis.
  4. Genre.
  5. Visualizations.
  6. Discussion.
  7. Contributions. Who did what?.

Part 1: Motivation

This section is to clarify the motivation behind the project ,and what we are going to investigate and how we are going to proceed.

The past couple years of all our lives have not been the best. COVID-19 pandemic spread havoc, increasing human suffering, destabilizing the global economy and upending the lives of billions of people around the globe. The importance of health and well being and its larger impact on the global machinery and day to day life of people became clearer to everyone through this troubling times.

Since as a team we all wanted to make a really impactful project that would provide meaningful insights that could help people make effective changes in their lives, we decided to delve deeper in the subsject of health and well being. After a lot of brainstorming, we decided to further analyze the Motor Vehicle Collisions - Crashes data set in New York.

Motor Vehicle Collisions dataset and why we want to analyze the dataset

Road accidents are responsible for 1.3 million deaths and 50 million injuries annually all over the world. Road crashes are also the leading killer of children and young people worldwide, aged five to 29. The Covid 19 pandemic killed about 6 million people in about 2 years. The reason we were able to get some control over the disease was through diligent efforts of scientists and doctors to better understand the disease and through this gained knowledge we were able to defeat the pandemic. The same needs to happen with global road safety. The pain of the families loosing their loved ones is excruciating and we need to put in efforts to better understand global road safety to make sure everyone's loved ones return safely home from their respective journeys.

This was our drive to work on this project as we all feel that this is the most pointless way to dies. Also we would like to mention that this project addresses the Sustainable Development Goal 3, regarding "Good Health and Well-being", more specifically target 3.6: Reduce road injuries and deaths. As things stand, they are set to cause a further estimated 13 million deaths and 500 million injuries during the next decade according to UN. Road accidents are entirely preventable, and our priority is to investigate causes so that preventive measures can be implemented.

There are also other countless disadvantages of car collisions (e.g. disrupt traffic flow, cost resources, wastes fuel ,loss of life). Other than homicides, the fatal incidents with which police have the most contact with the public are fatal traffic collisions. So there is huge potential in building a traffic safety model that can help detect reasons of collisions early and help implement systemic measure to prevent these unfortunate events.

Goals for the end user's experience:

So the goal for the end user's experience in this analysis is if first of all to give an understanding of the data in depth. Most people would have several expectations and assumptions for these patterns, like the most accident prone hours, night-driving, bad weather conditions, driver negligence etc. Therefore we want to provide a detailed overview of how the data is distributed along different time perspectives.

In particular, we want to show how the distributions have changed over the years, especially contrasting it with Covid-19 year(2220) as this may provide meaningful insights as the road activity was significantly decreased in 2020. This could provide insights the accidents from an population density perspective, where many hope to see more accidents per capita in more crowded regions.

Moreover, we also want to show to what extent we can predict the accidents, their number, how many people die, where does the next accident happen and the most relevant causes for the accident. And finally, we want to give the user a understanding of how the weather impacts these patterns, as a interesting perspective on the subject.

The dataset was extracted from the NYC OpenData. The size of the data set is 384 MB \ Number of rows 1.88M \ Number of variables is 29

In Part 2: Basics stats we will learn more about our dataset and get familier with it.

Part 2: Basic stats. Let's understand the dataset better.

2.0 : Setup - Load Libraries

Setup - Helper functions

2.1: Load the data

2.2: Data Cleaning and Preprocessing

The Motor Vehicle Collisions crash dataset contains details on the crash event. Each row represents a crash event. The Motor Vehicle Collisions data contain information from all police reported motor vehicle collisions in NYC. The police report (MV104-AN) is required to be filled out for collisions where someone is injured or killed, or where there is at least $1000 worth of damage.

The dataset has 29 columns for each data point mainly- CRASH DATE ,CRASH TIME , BOROUGH, ZIP CODE,LATITUDE, LONGITUDE, LOCATION, ON STREET NAME, CROSS STREET NAME, OFF STREET NAME, NUMBER OF PERSONS INJURED, NUMBER OF PERSONS KILLED, NUMBER OF PEDESTRIANS INJURED, NUMBER OF PEDESTRIANS KILLED, NUMBER OF CYCLIST INJURED, NUMBER OF CYCLIST KILLED, NUMBER OF MOTORIST INJURED, NUMBER OF MOTORIST KILLED, CONTRIBUTING FACTOR VEHICLE 1-5 (5 columnsof reasons for accident)2, COLLISION_ID(Unique record id), VEHICLE TYPE CODE 1-5(vehicle types involved in the crash- like bicycle,car/SUV).

Lets look at some samples of the dataset to get a better undersatnding of the data.

As you can see that there are a lot of NAN values is colums like Cross Street , Off street etc but since we dont use these columns in our analysis and just use latitude and longitude and borough(district) for geographical analysis of the dataset so we dont need to delete these NAN rows as these columns are removed in our analysis anyway. The columns that are being dropped are-'ZIP CODE','ON STREET NAME','CROSS STREET NAME', and 'OFF STREET NAME'. But since we are keeping latitude and longitude and borough(district), rows with NAN value in these columns are removed.

The dataset has data from 2010 to May 2022. In order to reduce the size of data, use the most relevant recent information and contrast road data of Non-Covid years with Covid years, we are filtering for years from 2018 to 2021.We leave out 2022 too as its an incomplete data year. We have also changed the CRASH DATA to datetime datatype to make the processing easier.

As you can see there are 5 columns with Contributing Factor and Vehicle, each corresponding to the 5 Vehicle Type for each vehicle involved in the accident. Lets look at the number of missing values in these columns to better gauge the number of vehicles usually involved in an accident.

As it can be seen above that the number of rows with atleast one nan in Contributing factor columns and number of rows when CONTRIBUTING FACTOR VEHICLE 5=nan is the same(553736). Also the number of nan's in the dataframe decrease as we go from CONTRIBUTING FACTOR VEHICLE 5 to CONTRIBUTING FACTOR VEHICLE 1. This means that the police officers fill in these columns from 1-5 depending on the number of vehicles involved in the accident. Same applies to Vehicle type code as well.

From this insight, we can create two new columns, one with the number of vehicles in the accident and two with the combined contributing factors.

Lets first clean the contributing factors columns and map them on a dictionary for easy combing and processing of this column

Now we will aggregate the information in Vehic and CONTRIBUTING FACTOR VEHICLE columns into two columns- one with vehicle count and the other being contributing factors list

Now we have the two new columns with vehicle count and combined contributing factors. By doing the above aggregation we were able to eliminate a lot of missing values and managed to condense the columns into useful information.

2.3: Reformat the data

Further to increase ease of use, we are pre-processing the data to create columns like Year, Month Time, Day Of Week Number, Day Of Week, Hour of the day, Time of day in percent of the day passed, and Hour of the week in a seperate dataframe. This was done to make the plotting and processing of time series for different time periods easier in the analysis and we used a new dataframe to keep the original one in mint condition.

2.4: Understanding the data

2.4.1. How Frequently Are Measurements Done?

The Motor Vehicle Collisions data tables contain information from all police reported motor vehicle collisions in NYC. The police report (MV104-AN) is required to be filled out for collisions where someone is injured or killed, or where there is at least $1000 worth of damage (https://www.nhtsa.gov/sites/nhtsa.dot.gov/files/documents/ny_overlay_mv-104an_rev05_2004.pdf).

2.4.2 Pandas Profiling

Part 3: Data Analysis

In this section, we will analyse the data in more detail. We will start by investigating the temporal patterns of the data - exploring the number of accidents by day of week, hour of the day, month, and hour of the week and year. Then we will look at a distribution of the accidents across different regions in the city on a heatmap. After that we will look at the accidents across districts in terrms of how many people died, people injured, motorist died, motorist injured, pedestrian died, pedenstrian injured, cyclist died and cyclist injured. We further provide a interactable visualization to explore the district on the death and injury tolls.

3 3.1. Investigating Temporal Patterns

In this first part of the data analysis, we'll look into the temporal patterns of the data to investigate questions like:

  1. At which days during the week do accidents happen the most?
  2. What hour of day people is most accident prone?
  3. Is there a month or season where accidents happen more?
  4. Has there been a change from 2018-2020? Has Covid-19 led to reduced accidents?

For the purpose of answering these questions, we use visualizations of aggregate accidents counts over time to make patterns more noticable. We found similar figures on plotting deaths/injured people count as well reaffirming common logic that more accidents lead to more people dying and getting injured. We show one plot for comparison for the viewer in the accidents by day of the week section. We believe that the visualizations would make it easier to gauge patterns over time, rather than trying to analyze a bunch of numbers.

3.3.3.1 Hourly Patterns

So the first pattern we want to investigate is the hourly patterns during a day. We want to examine how many accidents happen over hour of day. We wanted to see if patterns appear from looking at the total number of accidents happening over hour of day just over the entire time frame of 3 years.

The visualization above reveals some interesting results and reaffirms a lot of what we were expecting. The accidents number is low in the night hours except a minor peak at midnight. This is probably because of people sleeping and less people being on roads. Also it can be seen that the hours of 16-17 have the global peak, where most accidents happen in the day. This could be due to several reasons- peak traffic times , tiredness after a long day, more rushing to home mentality etc. We also found that contrary to popular beliefs, night driving is generally much safer than day driving as less accidents happen in the night hours, probably due to less people on the roads.

3.3.3.2 Weekly Patterns

In this section we explore the accident count based on a weekly time scale. First we look at plainly the number of accidents based on day of week aggregated over all years. Then we will look at accident count over the hour of week it happened in.

First, lets inspect the accident count based on day of week:

Here we clearly see that accidents happen less on the weekends. The accidents number are pretty consistent from Monday to Thursday but peak in the week on Friday. Our theory is that this is probably because people are most rushed on this day to get to home from work. If this is true, then we should see more accidents happening on Fridays during 15-17 in the week.

In order to test our theory and further find some more weekly patterns. We split the week into 168 hours, starting from 00:00 on Monday as hour 0 and ending with midnight on Sunday.

Lets loook at the accident count by hour of the week:

Looking at the visualization above, we do see a peak of 5966 accidents in the 112 hour of the week which comes to be 16h on Friday. The second maximum is on Friday to at 17h with 5924 accidents. The third maximum is on Thurday at 17 h with 5812 accidents followed by Thursday at 16h aith 5733 accidents.

These results indicate that Friday evenings are most dangerous times to drive between 16-18 followed by any working day betweeen the same hours. This indicates that people should be especially careful when on the road and show more patience especially on Fridays. The lack of patience and accident number is due to high traffic volumes but this reiterates the greater need to especially be calm when under stress.

3.3.3.3 Monthly Patterns

Now, we enlarge the time-perspective a bit and look at the monthly and seasonal patterns. We want to address the question of whether the month/season has an impact on how many accidents are there chooses to bicycle. To analyze this, lets first look at the number of accidents over the month of the year.

This visualization shows differences between accidents based on month. Most accidents happen in January. There is a slight dip in Februrary but that could also be because the number of days are less in Februrary. There is a significant dip in April which is suprising. Also the end of the year November and December has less accidents. This could be because of less people driving in the snow in the winters.

Lets look at the accident count by week of year to look deeper into the monthly patterns:

From the visualization, it appears that the last two weeks in December are verry low for accidents. This could be because of the holiday season and people being happy and relaxed and more considerate of other drivers. The last week of the year has the lowest accidents by far which could additionally because of people being at home with their families and less traffic on roads.

We also observe peaks starting the 1st week of January and peaking around mid january. This could be a result of people being more ambitious towards their new year goals and more rushed apart from other reasons.

The most suprising is the dip around week 13-15. This could be attributed to spring and Easter holidays. Another thing that might be skewing numbers for April is the strict lockdown announced in April 2020 due to the COVID-19 virus. To investigate this further we will look at yearly patterns in the next section.

3.3.3.4 Yearly Patterns

Finally we want to investigate whether there seem to be any patterns in how many accidents are there over the years. Here we are keen to look at the effect of COVID-19 and the lockdown on the number of accidents that happened and the contrast with non-lockdown years. For this we first visualize the number of accidents over the three years to see if can see any pattern.

From the plot we observe that there was a decrerase in the number of accidents going from 2018 to 2019 which is due to a general increase in road safety over years. Unsuprisingly 2020, saw a huge drop with the number of accidents almost falling to half. This was probably due to the COVID-19 pandemic and people staying and working from home and the lockdowns.

Lets investigate further by looking at monthly patterns over years:

There is a lot to unpack in the plot above. Firstly we can see that 2020-the COVID year has relatively low activity over the entire year in comparison to the other two years. Secondly its importnat to note how there is always a dip in the number of accidents for April which is suprising. This could be attributed to Easter Holidays but we are not sure. Thirdly, Februrary sees a dip probably due to lower number of days except in 2020 when the effect of lockdown in March has reduced the accident numbers. Lastly, and most importantly, there is a trend of general decrease in the number of accidents month over month which is a good sign meaning people are getting more aware of road safety.

Finally, lets look at the number of accidents over week over years to better see weekly deviations in accident patterns over these three years in New York:

From the above plot, firstly we would like to point out the dip starting around week 12 and dropping completely in in week 15 in 2020. Thats around end of March and April. This was the time the details about the new virus were coming out and lockdowns were being imposed so this is good the effect is seen in the data as well.

Secondly,we see a peak around week 25-26 which is about in June when the weather is good and more people are out on the roads. Even during the pandemic year higher levels of activity can be seen in the summer months.

Lastly we observe a dip in December which is probably due to the holiday season. From these visualization, it seems like holidays is a good time for road safety. This is probably due to more relaxed state of minds and a higher than usual desire to get to your loved ones safely.

3.3.2. Geographical Patterns

In this part of the data analysis, we'll look into the geographical patterns of the data in New York to investigate questions like:

  1. Are there certain locations more likely to have accidents?
  2. Are there certain boroughs more likely to have accident?
  3. Do cyclist die/injured more in some boroughs?
  4. Do motorist die/injured more in some boroughs?
  5. Are pedestrians killed/injured more in some boroughs?

For the purpose of answering these questions, we will use a heat map to look at a distribution of the accidents across different regions in the city . After that we will look at the accidents across districts in terms of how many people died, people injured, motorist died, motorist injured, pedestrian died, pedestrian injured, cyclist died and cyclist injured. We further provide a interactable visualization to explore the districts on their death and injury tolls.

We believe that the visualizations would make it easier to compare districts by the user depending on the district they live, work or commute through and the mode of transport used.

3.3.2.1. Distribution of accidents based on location.

So the first pattern we want to investigate here is the distribution of accidents across the states. To examine this, we visualize here a heat map based on the latitude and longitude of the accidents aggregated over the entire time period. Here we just want to see where do accidents happen if we forget the dimension of time for a bit. Lets look at the heat map to see the distribution in the cell below:

From the visualization above, it can be seen that most accidents happen in Manhattan. This is probably due to higher population density, more work places, more tourists, condensed housing and rasher driving due to more traffic. As you move away from the city center, the accident density is dropping as well on account of cars and population densities. Let's further look at district wise analysis in the next section to look deeper into the accident details of different regions.

3.3.2.2. District Wise Accidents Analysis

In this section, we wanted to compare the different boroughs of New York to see ho they are doing in terms of how many people died, people injured, motorist died, motorist injured, pedestrian died, pedestrian injured, cyclist died and cyclist injured. This would help identify the right infrastructure to improve and the right safety measures to implement through additional signs and campaigns targeted based on user and mode of transport. Lets look at the visualization for the districts to better understand this question:

From the plot above, it can be seen that Brooklyn has higher numbers of persons injured. Queens has a higher number of pedestrians killed which is making it the leading district in people killed followed by Brooklyn. Brooklyn has a higher number of cyclist killed, motorist killed and motorist injured followed by Queens. Brooklyn leads in cyclist injured followed by Manhattan and Queens. In pedestrians injured, Brooklyn leads as well followed by Queens and Manhattan.

From these observations, more measures need to implemented in Brooklyn for road safety for pedestrians, motorist and cyclists in general. Queens needs to seriously need to look into pedestrian safety followed by measures in areas of cyclist and motorist safety as well. Manhattan, though crowded, is still relastively safer but new measures need to implemented to combat harm to cyclist and pedestrians.

Further in the next visualization, we provide you an interative visualization based on choice of mode of transport.

3.3.3 Predictive model using accident data

In this section we want to predict the number of people killed. By using the accident data along with temporal information, we can hopefully gain a deeper understanding as to what, if any, role the time plays in accident patterns. First we will do a time series analysis where we would try to predict future crashes to umber of crashes happening per day for a particular year helps us to identify the occurence of crashes & time factors like (day of week, hour ,season,etc) helps in analysing the behaviour of crashes with respect to time. In this section, we tried to built a machine learning regression model for predicting the crashes occuring per day per hour for forecasting in the future also.

3.3.3.1 Time Series Analysis

The analysis of number of crashes happening per day for a particular year helps us to identify the occurence of crashes & time factors like (day of week, hour ,season,etc) helps in analysing the behaviour of crashes with respect to time. In this section, we tried to built a machine learning regression model for predicting the crashes occuring per day per hour for forecasting in the future also.

Initially the data required for time series must be made from original data by extracting time features from the original data followed by creation of crashes/hr (count of crash occured on a particular day in an hour) which is the target variable for prediction which is done in the following steps.

Rather than addition of time features the further features like Holidays: whether the day was a holiday or not Season : a categorical variable for depicting the season type of the month. Lag : as the event occuring at a particular time is dependent on the next event lags can be added as a feature in time series

After creation of relevant features required for the model, The model is split into Train(2018-01-01(January) to 2018-04-01(April)) & test set(whole month of April of year 2018) for prediction are constructed. Further the features are scaled and features and target variable is seprated and made for training.

XGB Regressor model is used to built the model

R2-score is a measure used for evaluation of train&test set in regression which is a measure of goodness of fit.It is basically the (total variance explained by the model)/total variance and explains how well the straight line fits with the datapoints .

MAPE(Mean Average Percentage error) is a measure used for evaluation of test set in regression. The objective is to minimize the error occuring while training. So the model must have low MAPE for good performance. It is basically the difference of absolute value of ((predicted value and true value) divided by the true value ) muliplied by 100 for percentage.

On the left side, the Quantile-Quantile (Q-Q) plots display the predicted over the observed values, the ones on the right overlay the predicted crashes with the observed ones over time. The performance of the XGboost model can be assessed more with the help of these plots which reflect the scores obtained in the previous part In the next section, cross validation process is followed to increase the robustness of the prediction models and determine other features that could improve it.

3.3.3.2 CROSS VALIDATION

In this section we are dividing the data set into different sets and validating the best model for prediction. hence we will be able to compare different models of regression and finalize to select the best model. To evaluate the performance of the different models it is not enough to compare the resulting scores from applying the prediction model defined above on the training sets. Cross validation is needed to develop more robust models and to determine the features that should be used for better predicting the crashes. The method used to determine the performance of models in time series data analysis is cross validation that divides the original training set into smaller training and validation sets or "windows" that 'slide' in specific time steps from the initial to the final dates.

The means of the model performance scores obtained for these sets for different models are collected and used to evaluate how well the model performs when predicting the target values.

After doing cross-validation for different models XGB regressor performs as the best model With the current set up, the linear regression model is performing poorly. This is observed in the negative test set R2 score as well as the high MAPE. Contrarily, the XGboost seems to perform better with R2 scores close to 1 for both sets and MAPE of about 30%. This can also be observed in the table shown above

3.3.3.3 Adding Weather data features

Inorder to make prediction model more generic, data of weather features for the year 2018 has been incorporated with the data to make it more generic with addition of weather features like - ('Temperature', 'Precipitation', 'Snow','Snow Depth', 'Wind Speed', 'Wind Direction','Visibility','Cloud Cover', 'Relative Humidity')

On the left side, the Quantile-Quantile (Q-Q) plots display the predicted over the observed values, the ones on the right overlay the predicted crashes with the observed ones over time. Here Random Forest model is used and its performance can be assessed more with the help of these plots which reflect the scores obtained in the previous part with addition of weather features inorder to make the model more general problem.

Part 4: Genre

Selecting the right genre for a story depends on on a variety of factors, including the complexity of the data, the complexity of the story, the intended audience and the intended medium.

Narrative visualisation offers various ways to present your data-driven story. There are seven genres of Narrative Visualization, namely magazine style, annotated chart, partitioned poster, flow chart, comic strip, slide show, and video.

In order to give the best experience for readers and enable them understand and investigate the analysis of data we have used a combination of magazine style and annotated charts to narrate the learnings from the data,with additional interactivity and messaging elements. The reason we choose this is because it would help providing quick learning for the authority who wants to minimize the occurrence of the crashes in a particular location. Our visualizations will be supported by messaging and interactivity to give the reader a better understanding by exploring the data themselves.

genres

4.4.1 Visual structure

The first category of narrative structure is visual structuring which is the overall structure of the narrative. In this category we have used Consistent Visual Platform where the content of each section changes but the general layout of the visual elements is consistent. For highlighting, we have used features such as Zooming and for Transition Guidance we have used Animated Transitions.

4.4.2 Narrative Structure

For the Narrative structure, linear ordering was used to make it easy for the user to digest all the information. Messaging is the use of text in order to provide observations and explanations about what is being presented, while interactivity gives the ability to the reader to manipulate the visualisations and focus on specific data of interest. The visualisations and respective interactivity were carefully chosen to engage the audience and enhance story discovery. We started with an introductory text and the motivation, then used captions / headlines for each section and summaries before or after the plots to present the findings and the points of interest. Thereby, the reader will both be met with appealing and informative visualisations, while the captions /headlines will guide them through the story and help them with the broader perspective. Also we payed special focus on transitioning from one part to the next, so that we keep the readerrs’s attention and consequently increase the overall memorability of the narrative. To achieve this, our focus was on switching smoothly between the different sections, where what is written in one part would lead to the next, thus keeping its linear structure.

Part 5: Visualizations

5.1 Explain the visualizations you've chosen.

Various visualizations have been chosen for best communicating the story of accidents in New York. We start with the temporal analysis of the data.In the hourly patterns section, we present a bar plot to present the number of accidents per hour aggregated over 3 years. We decided to use the interactivity that plotly offers, as we also wanted to show the number and other details in hover and provide zooming to the viewer to zoom into certain sections of the data.

We continued with more plotly plots to show other temporal patterns over weeks,months and years. Forr weekly plots, we used day of the week and hour of the week plots to show weekly patterns in data. For monthly patterns, we used months and week number plots to show patterns in the data. For yearly patterns we created subplots based on years showing patterns over months and weeks to distinguinsh between differrent years. All these plots were made in plotly to provide interactive plots to the user.

After that in the geographical analysis section, we use a folium heatmap too show the accident density over the state of New York. Folium as a tool is ideal for showing geospatial data such as our measuring locations, while giving the viewer a good understanding of the area we are investigating. By dragging the slider, the user gets to see the changes on the interactive map, in the form of colored circles. The size of the circle shows how much more accidents with color red indicating high density areas and blue representing low density regions. At a quick glance, you can get an impression from the colors alone on where the accidents happen.

In the next part, we use plotly again to show subplots for number of people killed, people injured, motorist died, motorist injured, pedestrian died, pedestrian injured, cyclist died and cyclist injured based on the districts of New York. This visualization helps us identify risks associated with different modes of transports based on different districts of New York. This along with a combined plot where user can switch between the views of people killed, motorist killed etc with their hover and zoom capability give user a way to explore the data themselves based in their interests.

-- ADD ABOUT PREDICTIVE ANALYSIS PLOTS HERE --

Lastly for investigating the predictiveness of accidents based on temporal patterns, we used different Machine Learning models for predictive analysis and found that the XGboost model provides the best results based. To show the stellar performance of the XGboost model, we used a Quantile-Quantile (Q-Q) plot to display the predicted values over the observed values. Also we show the predicted crashes vs actual crashes observed over time through a scatter plot to show the quality of our predictions.

In the next section, we add weather data to the dataset and make accident predictions using a Random Forest Model. A Quantile-Quantile (Q-Q) plot is used to display the predicted values over the observed values along with a scatter plot showing predicted crashes vs actual crashes observed over time.The nice thing about plotting the comparison this way is that you can see how far away are the real values to the predicted values and the colors provide a good contrast to see when the values are overlapping or not. It clearly communicates the main point - whether the model works or not.

5.2 Why are they right for the story you want to tell?

Different visualization plots were used for effective communication of our story. The story we want to tell is there is a temporal and demographic pattern to how accidents take place. Time of day, week, month and year along with the weather patterns that come with time and demography can be used as excellent predictors of accidents. The visualization we used helped us tell this temporal and demographic story along with showing the effectiveness of our machine learning models built using these temporal, demographic and weather features.

Part 6: Discussion

One of the great challenges with this data set has been the many gaps in the dataset. The missing values are a lot in some of the columns which make them completely unusable. Also factors like the number of different vehicles involved with the incident and the contributing factors had multiple columns and a lot of missing values which made it harder to clean these columns and aggregate information in them.

Another challenge we faced was making fair comparisons based on time periods like month, weekday.For example, Would it be fair to compare number of accidents or number of deaths/injuries for a meaningful temporal comparison of the accident dataset? We found that both metrics displayed similar information empirically which helped solve this dilemma. er and 2014 only in the winter.

The missing values decreased our usable dataset which led to more noise in our results which is expected. But we expect that our dataset was large enough to cancel out the effects of noise through hourly and weekly comparisons. We were very happy with our results of the yearly analysis. Our yearly analysis showed the effects of covid-19 on the road accidents along with also showing an yearly decrease in accidents showing more awareness around road safety.

However, those results may still be suffer from "biasing events" during a year. We did explore the Covid-19 pandemic but there could be other biasing events.For example, We did not go much deeper into the validity of the actual countings for individual days and search for possible biased regions of dates. For example, you might expect that when New York is hosting a major event like UN General Assembly then the number of accidents would be more due to more traffic in the city. We found dips in accidents in April but except Easter holidays, couldn't attribute it to anything else. In future work, it would be very interesting to dive even deeper into the data and try to filtrate for anomalies which may arise from events like these.

One other interesting thing to explore in future work are the contributing factors a

Part 7: Contributions

  1. Motivation: Pranjal
  2. Basic Stats: Tala & Pranjal
  3. Data Analysis:Ananthu
  4. Genre: Tala
  5. Visualizations: Ananthu
  6. Discussion: Pranjal Web page: Tala

Created in deepnote.com Created in Deepnote